date()
## [1] "Tue Nov 29 02:56:42 2022"
Hello! My name is Ghaida, you can also call me Gege. I am a first year master’s student at University of Helsinki’s Contemporary Societies program.
Click here to visit my Github page!
Click here to see my Learning Diary!
date()
## [1] "Tue Nov 29 02:56:42 2022"
data <- read.csv(file = "learning2014.csv", sep=",", header = TRUE)
dim(data)
## [1] 166 7
str(data)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : chr "F" "M" "F" "M" ...
## $ Age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ Points : int 25 12 24 10 22 21 21 31 24 26 ...
The data has 166 observations and 7 variables, which are:
| Variable name | Description |
|---|---|
| gender | Gender of the student (character, F/M) |
| Age | Age of the students in years (integer) |
| attitude | The average score of the student’s attitude towards statistics (numeric) |
| deep | The student’s average score on deep learning approach (numeric) |
| stra | The student’s average score on strategic learning approach (numeric) |
| surf | The student’s average score on surface learning approach |
| Points | The student’s exam points (integer) |
#accessing the libraries
library(ggplot2)
library(GGally)
#drawing scatter plot matrix
ggpairs(data, mapping = aes(col=gender, alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))
The data for female students are represented in pink, while the data for male students are represented in green. The frequency graph shows us that we have a lot more data from female students. The students’ age has a highly skewed distribution graph, which tells us that most of the students are on the younger side, but there are a few much older students as well. We can see that the student’s attitude towards statistics are moderately positively correlated with their exam points, which is not surprising. An interesting correlation can be seen between surface and deep learning scores. The two variables have a negative correlation, which means that students who score higher on surface learning tend to score lower in deep learning, and vice versa. However, a moderately high negative correlation is only observed in male students, not in female students. A similar phenomenon can also be observed between surface learning score and the students’ attitude towards statistics.
summary(data)
## gender Age attitude deep
## Length:166 Min. :17.00 Min. :1.400 Min. :1.583
## Class :character 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.333
## Mode :character Median :22.00 Median :3.200 Median :3.667
## Mean :25.51 Mean :3.143 Mean :3.680
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083
## Max. :55.00 Max. :5.000 Max. :4.917
## stra surf Points
## Min. :1.250 Min. :1.583 Min. : 7.00
## 1st Qu.:2.625 1st Qu.:2.417 1st Qu.:19.00
## Median :3.188 Median :2.833 Median :23.00
## Mean :3.121 Mean :2.787 Mean :22.72
## 3rd Qu.:3.625 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :5.000 Max. :4.333 Max. :33.00
Here, we can see the descriptive statistics of each variable in the data. The students’ age ranges from 17 to 55 years, while the average is 25.5 years old. The students’ attitude towards statistics averages at 3.14, which means that the students have a slightly more positive attitude towards statistics in general. Among the three learning approaches, deep learning has the highest average score while surface learning has the lowest average score. Meanwhile, the students’ exam point ranges from 7 to 33 and averages at 22.7.
library(tidyverse)
#3 variables as explanatory variables
fit <- data %>%
lm(Points ~ attitude + stra + deep , data = .)
summary(fit)
##
## Call:
## lm(formula = Points ~ attitude + stra + deep, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.5239 -3.4276 0.5474 3.8220 11.5112
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.3915 3.4077 3.343 0.00103 **
## attitude 3.5254 0.5683 6.203 4.44e-09 ***
## stra 0.9621 0.5367 1.793 0.07489 .
## deep -0.7492 0.7507 -0.998 0.31974
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 162 degrees of freedom
## Multiple R-squared: 0.2097, Adjusted R-squared: 0.195
## F-statistic: 14.33 on 3 and 162 DF, p-value: 2.521e-08
The low p-value and positive estimate suggest that the student’s attitude towards statistics is significantly and positively related to their exam points. We also found an evidence of a positive relationship between the students’ strategic learning score and their exam points (p-value = 0.075, significant at the 0.1 level). However, we did not find any evidence of a relationship between the students’ deep learning score and their exam points. Next, we will remove this variable and fit the model again without it.
#remove insignificant variable
fit2 <- data %>%
lm(Points ~ attitude + stra, data = .)
summary(fit2)
##
## Call:
## lm(formula = Points ~ attitude + stra, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.6436 -3.3113 0.5575 3.7928 10.9295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.9729 2.3959 3.745 0.00025 ***
## attitude 3.4658 0.5652 6.132 6.31e-09 ***
## stra 0.9137 0.5345 1.709 0.08927 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared: 0.2048, Adjusted R-squared: 0.1951
## F-statistic: 20.99 on 2 and 163 DF, p-value: 7.734e-09
We obtained a similar result with the previous model. With a very low p-value, the students’ attitude towards statistics was still found to be a highly significant predictor of their exam point. The students’ exam score increases by 3.46 points on average with each increase in the students’ attitude score, assuming that their strategic learning score remains constant. Meanwhile, the students’ strategic learning score was found to be significant at 0.1 level (p-value = 0.08). The student’s exam score increases by 0.91 points on average with each increase in the students’ strategic learning score, assuming that their attitude score remains constant. In another word, students who have a more positive attitude towards statistics and/or practices strategic learning approach tend to have a higher exam point.
The adjusted R-squared of this model was 0.1951, which means that this model explains about 19.51 percent of the variations in the student’s exam point. Considering that this is a very simple model, this is already a quite high R-squared value.
In linear regression, we assume a linear correlation between the variables, and the error term/residual is normally distributed. We also assume that the variance of the residuals are equal across all predicted values (homoscedasticity). The residuals vs. fitted values plot and the Q-Q plot can be used to check these assumptions.
plot(fit2, which = c(1, 2, 5))
In the residuals vs. fitted values plot, we can see that the residuals seem to be randomly scattered. It does not seem to display any concerning patterns, such as a curve (suggesting non-linearity) or a trombone pattern (suggesting heteroscedasticity). Based on this plot, it seems that the data is linear and homoscedastic (the variance of the residuals tend to be equal across all predicted values).
The Q-Q plot compares the standardized residuals to their theoretical quantiles (the values they should have if the normality assumption is fulfilled). If the assumption is fulfilled, the points should fall across the straight line. In this plot, we can see that the points seem to form a slight upward curve. This means that the distribution of the residuals are actually a bit left skewed.
The residuals vs leverage plot is used to check if there is any outliers that might affect the model heavily. There seem to be several extreme values in the data, but none of them fall outside of the Cook’s distance line, so they are not necessarily considered to be influential to the model.
date()
## [1] "Tue Nov 29 02:56:53 2022"
data <- read.csv(file = "alc.csv", sep=",", header = TRUE)
#printing out the variables name
colnames(data)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "guardian" "traveltime" "studytime" "schoolsup"
## [16] "famsup" "activities" "nursery" "higher" "internet"
## [21] "romantic" "famrel" "freetime" "goout" "Dalc"
## [26] "Walc" "health" "failures" "paid" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
This week, we are working with the “Student Performance Data Set” from the UCI Machine Learning Repository. This data was made to study student performance. It contains 35 variables from 370 students in two Portugese high schools. The data was collected from school reports and questionnaires. The variables in this data include:
| Variable Name | Description |
|---|---|
| school | The student’s school (“GP” (Gabriel Pereira) or “MS” (Mousinho da Silveira)) |
| sex | The student’s sex (“F” or “M”) |
| age | The student’s age (numeric from 15-22) |
| address | The student’s home address type (“U” (urban) or “R” (rural)) |
| famsize | The student’s family size (“LE3” (<=3) or “GT3” (>3)) |
| Pstatus | The parent’s cohabitation status (“T” (living together) or “A” (apart)) |
| Medu | The mother’s education level (“0” (no education), “1” (up to 4th grade), “2” (5th to 9th grade), “3” (secondary education), or “4” (higher education)) |
| Fedu | The father’s education level (“0” (no education), “1” (up to 4th grade), “2” (5th to 9th grade), “3” (secondary education), or “4” (higher education)) |
| Mjob | The mother’s job (“teacher”, “health” related, civil “services” (e.g. administrative or police), “at_home”, or “other” |
| Fjob | The father’s job (“teacher”, “health” related, civil “services” (e.g. administrative or police), “at_home”, or “other” |
| reason | The reason for choosing the school (close to “home”, school “reputation”, “course” preference, or “other”) |
| guardian | The student’s guardian (“mother”, “father”, or “other”) |
| traveltime | Home to school travel time (“1” (<15 min), “2” (15 to 30 min), “3” (30 min. to 1 hour), or “4” (>1 hour)) |
| studytime | Time spent on studying weekly (“1” (<2 hours), “2” (2 to 5 hours), “3” (5 to 10 hours), or “4” (>10 hours)) |
| failures | Number of past class failures (numeric from 1-4 where 4 includes 4 or more classes) |
| schoolsup | Extra educational support (“yes” or “no”) |
| famsup | Family educational support (“yes” or “no”) |
| paid | Extra paid classes within the course subject (“yes” or “no”) |
| activities | Extracurricular activities (“yes” or “no”) |
| nursery | Attended nursery school (“yes” or “no”) |
| higher | Wants to take higher education (“yes” or “no”) |
| internet | Availability of internet access at home (“yes” or “no”) |
| romantic | In a romantic relationship (“yes” or “no”) |
| famrel | Quality of family relationship (numeric from 1 (very bad) to 5 (excellent)) |
| freetime | Free time after school (numeric from 1 (very low) to 5 (very high)) |
| goout | Going out with friends (numeric from 1 (very low) to 5 (very high)) |
| Dalc | Workday alcohol consumption (numeric from 1 (very low) to 5 (very high)) |
| Walc | Weekend alcohol consumption (numeric from 1 (very low) to 5 (very high)) |
| health | Current health status (numeric from 1 (very bad) to 5 (very good)) |
| absences | Number of school absences (numeric from 0-93) |
| G1 | First period grade (numeric from 0-20) |
| G2 | Second period grade (numeric from 0-20) |
| G3 | Final grade (numeric from 0-20) |
| alc_use | Average alcohol use in both weekend (Walc) and weekdays (Dalc) (between 1-5) |
| high_use | High use of alcohol (“TRUE” (alc_use >2) or “FALSE” (alc_use <=2) |
Male students (sex) tend to have higher alcohol consumption (high_use)
Students with more school absences (absences) would tend to have higher alcohol consumption (high_use)
Students who spend less time studying (studytime) would tend to have higher alcohol consumption (high_use)
Students who go out more (goout) would tend to have higher alcohol consumption (high_use)
library(patchwork)
library(tidyverse)
library(finalfit)
data %>%
summary_factorlist(dependent = "high_use",
explanatory = "sex")
## label levels FALSE TRUE
## sex F 154 (59.5) 41 (36.9)
## M 105 (40.5) 70 (63.1)
The cross tabulation above reveals that among 195 female students in the data, 41 of them have high alcohol consumption. Meanwhile, among 175 male students in the data, 70 of them have high alcohol consumption. Among students who have high alcohol consumption, 63.1 percent are male. We will next visualize this with plots.
#count plot
t1 <- data %>%
ggplot(aes(x = sex, fill = high_use)) +
geom_bar() +
theme(legend.position = "none")+
theme(legend.position = "bottom")+
labs(x = "Gender (sex)")
t1 <- t1 + scale_fill_discrete(name = "High use of alcohol")
#proportion plot
t2 <- data %>%
ggplot(aes(x = sex, fill = high_use)) +
geom_bar(position = "fill") +
ylab("proportion")+
theme(legend.position = "none")+
labs(x = "Gender (sex)")
t1+t2
From the bar plot above, we can see that even though we have more female students in the data, the number of male students with high alcohol consumption is higher. This is more apparent in the proportion plot, where we can see that the proportion of students with high alcohol consumption is a lot higher in male students compared to female students.
data %>%
summary_factorlist(dependent = "high_use",
explanatory = "absences")
## label levels FALSE TRUE
## absences Mean (SD) 3.7 (4.5) 6.4 (7.1)
Students with high alcohol consumption on average has around 6.4 absences, while students with lower alcohol consumption on average has 3.7 absences. Next, we will also compare the distribution of each groups with box plots.
data %>%
ggplot(aes(x = absences, y = high_use)) +
geom_boxplot(aes(colour=high_use))+
labs(x = "Number of school absences (absences)",
y = "High use of alcohol")+
theme(legend.position = "none")
Students with high alcohol consumption have a higher school absences median compared to students with lower use of alcohol. Both groups have quite skewed data, which means that most students have very few absences but there are some students who have very high number of absences.
For better presentation of the data, I will first recode the variable:
data <- data %>%
mutate(studytime2 = studytime %>% factor() %>%
fct_recode("<2 hours" = "1", "2-5 hours" = "2",
"5-10 hours" = "3", ">10 hours" = "4"))
#cross tabulation
data %>%
summary_factorlist(dependent = "high_use",
explanatory = "studytime2")
## label levels FALSE TRUE
## studytime2 <2 hours 56 (21.6) 42 (37.8)
## 2-5 hours 128 (49.4) 57 (51.4)
## 5-10 hours 52 (20.1) 8 (7.2)
## >10 hours 23 (8.9) 4 (3.6)
Among students who have high alcohol consumption, 51.4 percent spend 2-5 hours/week studying and 37.8 percent spend less than 2 hours studying. In total, 89.2 percent of students with high alcohol consumption spend 5 hours or less per week for studying, which is quite a high number. Next, we will visualize this with plots.
#count plot
f1 <- data %>%
ggplot(aes(x = studytime2, fill = high_use)) +
geom_bar() +
theme(legend.position = "none")+
theme(legend.position = "bottom")+
labs(x = "Time spent on studying weekly")
f1 <- f1 + scale_fill_discrete(name = "High use of alcohol")
#proportion plot
f2 <- data %>%
ggplot(aes(x = studytime2, fill = high_use)) +
geom_bar(position = "fill") +
ylab("proportion")+
theme(legend.position = "none")+
labs(x = "Time spent on studying weekly")
f1+f2
Most students spent 2-5 hours weekly for studying. Students in this group also have the highest count of students with high alcohol consumption, but according to the proportion plot, students who spend lest than 2 hours for studying have the highest proportion of students with high alcohol consumption. Meanwhile, the proportion of students with high alcohol consumption is not very different between the group of students who study for 5-10 hours/week and students who spend more than 10 hours/week to study.
For better presentation of the data, I will first recode the variable:
data <- data %>%
mutate(goout2 = goout %>% factor() %>%
fct_recode("very rarely" = "1", "rarely" = "2",
"medium" = "3", "often" = "4", "very often" = "5"))
#cross tabulation
data %>%
summary_factorlist(dependent = "high_use",
explanatory = "goout2")
## label levels FALSE TRUE
## goout2 very rarely 19 (7.3) 3 (2.7)
## rarely 82 (31.7) 15 (13.5)
## medium 97 (37.5) 23 (20.7)
## often 40 (15.4) 38 (34.2)
## very often 21 (8.1) 32 (28.8)
Among students with high alcohol consumption, 34.2 percent go out often with their friends and 28.8 percent go out very often. From this table, we can already see that the proportion of students with high alcohol consumption is higher in groups of students who go out more often, but we will check this with plots.
#count plot
g1 <- data %>%
ggplot(aes(x = goout2, fill = high_use)) +
geom_bar() +
theme(legend.position = "none")+
theme(legend.position = "bottom")+
labs(x = "Going out with friends")
g1 <- g1 + scale_fill_discrete(name = "High use of alcohol")
#proportion plot
g2 <- data %>%
ggplot(aes(x = goout2, fill = high_use)) +
geom_bar(position = "fill") +
ylab("proportion")+
theme(legend.position = "none")+
labs(x = "Going out with friends")
g1+g2
In the proportion plot, we can see that the proportion of students with high alcohol consumption gets higher in groups of students who go out more. The difference is subtle between the groups of students who go out less often, but in groups of students who go out often or very often, the proportion is very high.
# find the model with glm()
# I will treat all explanatory variables as factors except for absences
m <- glm(high_use ~ sex + absences + studytime2 + goout2, data = data, family = "binomial")
# print out a summary of the model
summary(m)
##
## Call:
## glm(formula = high_use ~ sex + absences + studytime2 + goout2,
## family = "binomial", data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8789 -0.7759 -0.4636 0.7578 2.4917
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.55088 0.72954 -3.497 0.000471 ***
## sexM 0.89483 0.27955 3.201 0.001370 **
## absences 0.07846 0.02312 3.394 0.000688 ***
## studytime22-5 hours -0.40001 0.30551 -1.309 0.190424
## studytime25-10 hours -0.89977 0.48249 -1.865 0.062205 .
## studytime2>10 hours -1.17641 0.63893 -1.841 0.065591 .
## goout2rarely 0.44073 0.73079 0.603 0.546448
## goout2medium 0.69593 0.71421 0.974 0.329855
## goout2often 2.10900 0.71571 2.947 0.003212 **
## goout2very often 2.37031 0.73186 3.239 0.001200 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 362.43 on 360 degrees of freedom
## AIC: 382.43
##
## Number of Fisher Scoring iterations: 4
According to the p-value and the positive estimate, male students significantly have higher tendency to have high alcohol consumption (p-value = 0.0014). The number of absences was also found to be a significant predictor to high alcohol consumption (p-value = 0.0007). Meanwhile, students who spend 5 hours or more for studying each week were found to be significantly less likely to have high alcohol consumption at 0.1 level compared to students who study less than 2 hours/week (p-value = 0.062 for students who study 5-10 hours/week and 0.065 for students who study more than 10 hours/week). Students who go out often or very often are significantly more likely to have high alcohol consumption compared to students who go out very rarely (p-value = 0.003 for students who go out often and 0.001 for students who go out very often).
# compute odds ratios (OR)
OR <- coef(m) %>% exp
# compute confidence intervals (CI)
CI <- confint(m) %>% exp()
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.07801306 0.01489892 0.281652
## sexM 2.44691728 1.42235692 4.267696
## absences 1.08162325 1.03506042 1.134694
## studytime22-5 hours 0.67031109 0.36749404 1.221300
## studytime25-10 hours 0.40666321 0.15038986 1.014202
## studytime2>10 hours 0.30838487 0.07773944 0.998947
## goout2rarely 1.55384652 0.41639530 7.944878
## goout2medium 2.00557429 0.56021310 10.005175
## goout2often 8.23995705 2.30747880 41.326016
## goout2very often 10.70074528 2.88295245 54.872360
The table above presents the odds ratio for each variable in the model and its confidence interval.
Male students are 2.45 (CI = 1.42 - 4.27) times more likely to have high alcohol consumption compared to female students. This finding supports our original hypothesis.
For each increase in the number of school absences, the students are 1.08 (CI = 1.04 - 1.13) times more likely to have high alcohol consumption. This finding also supports our original hypothesis.
We found a very weak evidence of a relationship between time spent on studying and high alcohol consumption. The only group within the study time variable that has a confidence interval that do not cross 1 is the group of students who study for more than 10 hours a week. The students in this group are 3.25 (CI = 1.002 - 12.82) times less likely to have high alcohol consumption compared to students who study for less than 2 hours/week. I flipped the odds (1/x) to have a more intuitive interpretation. Meanwhile, every other groups in this variable have a confidence interval that crosses 1, which means that from this model, it is unclear if they are significantly less likely to have high alcohol consumption.
Compared to students who very rarely go out, students who go out often are 8.24 (CI = 2.31 - 41.33) times more likely to have higher alcohol consumption. Students who go out very often are 10.7 (CI = 2.89 - 54.87) times more likely to have higher alcohol consumption. Meanwhile, it is not apparent from the model if students who go out moderately or rarely are significantly more likely to have high alcohol consumption compared to students who very rarely go out. In general, this finding goes in line with our original hypothesis.
I will use all variables since they are all significant at least at 0.1 level.
# predict() the probability of high_use
probabilities <- predict(m, type = "response")
# add the predicted probabilities to data
data <- mutate(data, probability = probabilities)
# use the probabilities to make a prediction of high_use
data <- mutate(data, prediction = probabilities > 0.5)
# tabulate the target variable versus the predictions
table(high_use = data$high_use, prediction = data$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 239 20
## TRUE 55 56
Among 294 students who are predicted to not have high alcohol consumption, 239 are predicted correctly. Meanwhile, among 76 students who are predicted to have high alcohol consumption, 56 are predicted correctly. Next we will visualize this with a plot.
#count plot
p1 <- data %>%
ggplot(aes(x = prediction, fill = high_use)) +
geom_bar() +
theme(legend.position = "none")+
theme(legend.position = "bottom")+
labs(x = "Model Prediction")
p1 <- p1 + scale_fill_discrete(name = "High use of alcohol")
#proportion plot
p2 <- data %>%
ggplot(aes(x = prediction, fill = high_use)) +
geom_bar(position = "fill") +
ylab("proportion")+
theme(legend.position = "none")+
labs(x = "Model Prediction")
p1+p2
From the proportion plot, we can see that most of the students predicted to have high alcohol consumption do have a high alcohol consumption in the data, and vice versa. Next, we will compute the total proportion of inaccurately classified individuals.
# define a loss function (average prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# call loss_func to compute the average number of wrong predictions in the (training) data
train_er <- loss_func(class = data$high_use, prob = data$probability)
train_er
## [1] 0.2027027
The proportion of wrong predictions is 20.27 percent, which is a lot better than simple guessing strategy (50 percent).
library(boot)
cv <- cv.glm(data = data, cost = loss_func, glmfit = m, K = 10)
# average number of wrong predictions in the cross validation
test_er <- cv$delta[1]
test_er
## [1] 0.2189189
The average number of wrong predictions in the cross validation was around 22 percent, which is a little lower than the model in the Exercise set (26 percent).
# model with 3 variables
m3 <- glm(high_use ~ sex + absences + goout2, data = data, family = "binomial")
probabilities3 <- predict(m3, type = "response")
data <- mutate(data, probability3 = probabilities3)
data <- mutate(data, prediction3 = probabilities3 > 0.5)
train_er3 <- loss_func(class = data$high_use, prob = data$probability3)
cv3 <- cv.glm(data = data, cost = loss_func, glmfit = m3, K = 10)
test_er3 <- cv3$delta[1]
# model with 2 variables
m2 <- glm(high_use ~ sex + absences, data = data, family = "binomial")
probabilities2 <- predict(m2, type = "response")
data <- mutate(data, probability2 = probabilities2)
data <- mutate(data, prediction2 = probabilities2 > 0.5)
train_er2 <- loss_func(class = data$high_use, prob = data$probability2)
cv2 <- cv.glm(data = data, cost = loss_func, glmfit = m2, K = 10)
test_er2 <- cv2$delta[1]
# model with 1 variable
m1 <- glm(high_use ~ absences, data = data, family = "binomial")
probabilities1 <- predict(m1, type = "response")
data <- mutate(data, probability1 = probabilities1)
data <- mutate(data, prediction1 = probabilities1 > 0.5)
train_er1 <- loss_func(class = data$high_use, prob = data$probability1)
cv1 <- cv.glm(data = data, cost = loss_func, glmfit = m1, K = 10)
test_er1 <- cv1$delta[1]
#plotting
compare <- data.frame(model = c("4 Var", "3 Var", "2 Var", "1 Var",
"4 Var", "3 Var", "2 Var", "1 Var"),
error = c(train_er, train_er3, train_er2, train_er1,
test_er, test_er3, test_er2, test_er1),
type = c("Training Error", "Training Error",
"Training Error", "Training Error",
"Testing Error", "Testing Error",
"Testing Error", "Testing Error"))
compare %>%
ggplot(aes(x = model, y = error, group = type)) +
geom_line(aes(colour = type))+
scale_x_discrete(limits = rev) +
labs(x = "Number of variables",
y = "Error rate")+
geom_text(
label=round(compare$error, digits=2),
check_overlap = T)
Generally, both testing and training error is higher as the number of variables in the model gets lower. This is not surprising as models with higher number of (appropriate) variables would tend to have better prediction. However, we see a dip in testing error rate at three variables compared to a model with four variables. This may mean that the fourth variable we add to the model did not help better the prediction much, so it may be better to use the three variables model instead.
date()
## [1] "Tue Nov 29 02:56:59 2022"
# access the libraries
library(MASS)
# load the data
data("Boston")
dim(Boston)
## [1] 506 14
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
This week we are working with the “Housing Values in Suburbs of Boston” data. This data describes housing values in 506 suburbs (observations) of Boston and other variables related to it. In total, the data has 14 variables which includes:
| Variable Name | Description |
|---|---|
| crim | per capita crime rate per town (numeric) |
| zn | proportion of residential land zoned for lots over 25,000 sq.ft (numeric) |
| indus | proportion of non-retail business acres per town (numeric) |
| chas | Charles River dummy variable (integer, “1” if tract bounds river; “0” otherwise) |
| nox | nitrogen oxides concentration (parts per 10 million) |
| rm | average number of rooms per dwelling (numeric) |
| age | proportion of owner-occupied units built prior to 1940 (numeric) |
| dis | weighted mean of distances to five Boston employment centres (numeric) |
| rad | index of accessibility to radial highways (integer) |
| tax | full-value property-tax rate per $10,000 (numeric) |
| ptratio | pupil-teacher ratio by town (numeric) |
| black | 1000(Bk−0.63)2 where
Bk is the proportion of blacks by town (numeric) |
| lstat | lower status of the population (percent) |
| medv | median value of owner-occupied homes in $1000s (numeric) |
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
Crime rate ranges from 0.006 - 88.97. The median (0.26) and the average (3.6) are very different, and the maximum value (88.97) is also quite far from the third quantile (3.7). This means that most of the towns in Boston have very low crime rate, but there are some towns that has very high crime rate and brings up the average.
Similar with crime rate, variable zn (proportion of residential land zone for lots over 25,000 sq ft) also has a very skewed distribution. It ranges from 0 to 100. The median lies at 0 while the average is 11.36. It seems like the proportion of residential land zone for most of the towns is 0, but some towns have very high proportion.
Proportion of non-retail business acres per town (indus) ranges from 0.46 to 27.74 and averages at 11.14.
The variable chas (Charles River dummy variable) is categorical, so it only has 2 values: 0 and 1. The very low average tells us that the value for this variable is 0 for most of the town in this dataset.
Nitrogen oxides concentration (nox) ranges from 0.385 to 0.871 and averages at 0.554.
The average number of rooms per dwelling (rm) ranges from 3.5 to 8.7 and averages at 6.3.
The proportion of owner-occupied units built prior to 1940 (age) ranges from 2.9 to 100 and averages at 68.57.
The weighted mean of distances to five Boston employment centres (dis) ranges from 1.13 to 12.13 and averages at 3.79.
Index of accessibility to radial highways (rad) ranges from 1 to 24 and averages at 9.5.
Full-value property-tax rate per $10,000 (tax) ranges from 187 to 711 and averages at 408.2.
The pupil-teacher ratio by town (ptratio) ranges from 12.6 to 22 and averages at 18.46.
The proportion of blacks (to be exact it is actually
1000(Bk−0.63)2 where Bk is
the proportion of blacks by town) ranges from 0.32 to 396.9 and averages
at 356.67.
Lower status of the population (lstat) ranges from 1.73 to 37.97 and averages at 12.65.
The median value of owner-occupied homes (in thousand dollars) ranges from 5 to 50 and averages at 22.53.
I would use the ggpairs() function instead of pairs() and corrplot() functions to have a more compact result. I will also divide the variables into two graphs to have a clearer visualization. We will later work with the crime variable as the dependent variable, so I will focus on it. Because we are using housing-related variables to classify crime rate, the interpretations would be a little weird! I also found the variables to be quite complicated, it is hard to express them in simpler words.
#accessing the libraries
library(ggplot2)
library(GGally)
ggpairs(Boston[1:7], mapping = aes(), lower = list(combo = wrap("facethist", bins = 20)))
Confirming the observations we made at the previous section, we can see that crime rate variable (crim) and proportion of residential land (zn) show an extremely skewed distribution. The scatter plot and the correlation coefficient show a low and negative correlation (-0.2) between crime rate and proportion residential land (zn). Similar relationship can be seen between crime rate and average number of rooms (rm). Meanwhile, crime rate has a positive correlation to proportion of non-retail business acres per town (indus, coef = 0.407), nitrogen oxides concentration (nox, coef - 0.421), and proportion of owner-occupied units built prior to 1940 (age, coef = 0.353).
ggpairs(Boston, columns = c("crim", "dis", "rad", "tax", "ptratio",
"black", "lstat", "medv"), mapping = aes(),
lower = list(combo = wrap("facethist", bins = 20)))
Crime rate is negatively correlated with the weighted mean of distances to five Boston employment centres (dis, coef = -0.38), proportion of black (black, coef = -0.385), and median value of owner-occupied homes (medv, coef = -0.388). Meanwhile, it has a positive relationship with index of accessibility to radial highways (rad, coef = 0.626), full-value property-tax rate (tax, coef = 0.583), pupil-teacher ratio (ptratio, coef = 0.29) and lower status of the population (lstat, coef = 0.456).
# center and standardize variables
boston_scaled <- scale(Boston)
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)
# summaries of the scaled variables
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
We standardize the variables to have them all on the same scale. Now, since we have standardized the dataset, all variables have 0 as the average value. They should also have 1 as the standard deviation.
We will categorize crime rate into four categories: “low”, “med_low”, “med-high”, and “high”. We will use the quantiles as the break points for the categorization. Then, we will replace the original crime rate variable (crim) with the new categorized variable (crime).
# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label = c("low", "med_low", "med_high", "high"))
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
Now, we will divide the data into two: train dataset and test dataset. The train dataset will have 80% of the original dataset, while the test dataset will have the remaining 20%. The train dataset will be used to build our model while the test dataset will be used to test our model.
library(dplyr)
library(MASS)
library(patchwork)
# number of rows in the Boston dataset
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set
test <- boston_scaled[-ind,]
# linear discriminant analysis
lda.fit <- lda(crime ~., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2574257 0.2698020 0.2326733 0.2400990
##
## Group means:
## zn indus chas nox rm age
## low 1.02479729 -0.9405142 -0.08304540 -0.8775489 0.4402499 -0.8596091
## med_low -0.08777476 -0.2897367 -0.01948777 -0.5770765 -0.1074116 -0.3918435
## med_high -0.36681929 0.2184969 0.10462734 0.4772700 0.1368950 0.4300311
## high -0.48724019 1.0172187 -0.10997442 1.0617826 -0.4826658 0.8148632
## dis rad tax ptratio black lstat
## low 0.9159415 -0.6859201 -0.7479900 -0.45291375 0.37809131 -0.77322131
## med_low 0.3261444 -0.5519863 -0.4863138 -0.04514452 0.31656076 -0.18201222
## med_high -0.3867928 -0.4198554 -0.3131006 -0.38387364 0.04801848 0.02281092
## high -0.8661575 1.6371072 1.5133254 0.77958792 -0.77376007 0.91567556
## medv
## low 0.52074099
## med_low 0.03573533
## med_high 0.16947542
## high -0.72308979
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.08354228 0.68628331 -0.97216749
## indus 0.05438092 -0.34463038 0.24133768
## chas -0.12580476 -0.01056504 0.14255563
## nox 0.42432656 -0.83263416 -1.29005146
## rm -0.09256571 -0.18032511 -0.18708537
## age 0.15162290 -0.27561298 -0.29387696
## dis -0.04084488 -0.36879816 -0.02550853
## rad 3.37459640 0.98230172 -0.23168829
## tax 0.02050378 -0.04969296 0.62726128
## ptratio 0.11192561 0.02876548 -0.18175195
## black -0.11791282 0.03922974 0.12235154
## lstat 0.21939091 -0.19404543 0.35302481
## medv 0.18099343 -0.39363635 -0.13849704
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9484 0.0371 0.0146
For each observation in the dataset (for each town in Boston), LDA computes the probability of belonging to each crime rate group. Then, they will be allocated to the group with the highest probability score.
Prior probabilities of groups shows the proportion of each group in the data. Since we divided the data using the quantiles and the training set is chosen randomly, the prior probabilities of the groups should not differ much.
Group means show the average of each variable in each group. It is a little hard to interpret directly since the variables are already standardized, but we can see that some variables are showing a descending or ascending trend. For example, we can see that the average of rad variable (index of accessibility to radial highways) gets higher in groups with higher crime rate.
Coefficients of linear discrimination shows the coefficients of the first, second, and third discriminant function. We use these functions to discriminate the groups.
Proportion of trace is the percentage of separation achieved by each discriminant function. So, the first discrimination function achieves around 94 percent of separation.
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2)
The x axis shows the score from the first discriminant function (LD1) and the y axis shows the score from the second discriminant function (LD2). There are some amount of overlaps between the groups. So, if we calculate the discriminant score of a town using the first discriminant function (LD1) and get a high LD1 score (above 4), most probably that town would be categorized as “high” crime rate. The arrows represent each predictor variables. Longer arrow represents higher coefficients in the discriminant function (it is scaled twice as big so we can see the arrows better). We can see that rad variable (index of accessibility to radial highways) is the most discriminating variable and it has positive coefficients in both the first (LD1) and the second (LD2) discriminant functions.
We will now predict the crime rate classification in our testing set with our LDA model. Then, we will compare the result to the real classification.
# save the correct classes from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 10 11 2 0
## med_low 0 16 1 0
## med_high 0 16 14 2
## high 0 0 0 30
Here, we have the cross tabulation of the real and predicted categories. Ideally, the data should all belong to the same groups (towns with low crime rate should be predicted to have low crime rate). Most of the towns are classified correctly, but some are classified incorrectly. We can count the number of observations that are classified correctly by summing the diagonal line (upper left to bottom right) of the cross tabulation. The observations outside of the diagonal line are not classified correctly.
#reloding Boston dataset
data("Boston")
#standardize
boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled)
# euclidean distance matrix
dist_eu <- dist(boston_scaled)
set.seed(123)
# determine the number of clusters
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
We need to look for the point where the WCSS value changes most significantly. In this graph, the optimal number of cluster seem to be 2.
# k-means clustering
km <- kmeans(boston_scaled, centers = 2)
# plot the Boston dataset with clusters
pairs(boston_scaled, col = km$cluster)
We divided the data into two clusters, expressed with different colors in the graph. In some variables, the clusters are defined quite clearly while in some others the difference is less clear. For example, if we look at the graphs of crim (crime rate) variable, it is quite clear that towns with low crime rate are clustered together. Meanwhile, the clusters are less clear with dis (weighted mean of distances to five Boston employment centres) variable. Although there is a tendency that the red cluster is within the lower ranges, the separation is not very clear.
#reloding Boston dataset
data("Boston")
#standardize
boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled)
# k-means clustering
km <- kmeans(boston_scaled, centers = 4)
#extracting the result
boston_scaled <- cbind(boston_scaled, cluster = km$cluster)
# linear discriminant analysis
fit <- lda(cluster ~., data = boston_scaled)
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(boston_scaled$cluster)
# plot the lda results
plot(fit, dimen = 2, col = classes, pch = classes)
lda.arrows(fit, myscale = 3)
There are some overlap, but in general the data is separated quite clearly. Variable chas (Charles River dummy variable) seem to be the most influential separators among the variables in the dataset. Other influential separators are rad (index of accessibility to radial highways) and dis (weighted mean of distances to five Boston employment centres).
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
library(plotly)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = train$crime)
model_predictors <- dplyr::select(boston_scaled, -cluster)
# check the dimensions
dim(model_predictors)
## [1] 506 14
dim(fit$scaling)
## [1] 14 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% fit$scaling
matrix_product <- as.data.frame(matrix_product)
#plot
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = boston_scaled$cluster)
The two plot differs because the classification is different. The first plot is classifying the murder rate, while the second one is classifying the cluster. The classification in the second plot is much clearer, probably because the clusters were made based on the whole data itself. However, in both plots the data seem to be separated into two big groups.